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Abstract 

This paper describe a system-level approach to improve the 
area and delay of datapath designs that perform polynomial 
computations over Zm, which are used in many applications 
such as computer graphics and digital signal processing 
domains. This approach optimizes the implementation of 
multivariate polynomial systems in terms of the number of 
arithmetic operations by performing optimization on a 
system level prior to high-level synthesis. Univariate 
functional decomposition of polynomial expressions and 
canonization form over Zzm are used in this method. We use 
GAUT high-level synthesis tool to generate RTL datapath 
architectures for the optimized polynomials. Experimental 
results on a set of benchmark applications with polynomial 
expressions show that this method outperforms conventional 
methods in terms of the area of the sequential datapath 
architectures in speed optimization mode with an average 
improvement of 25.81%, and the required clock cycles in two 
modes of speed optimization and area optimization, with an 
average improvement of 23.48% and 38.24%, respectively. 


Keywords 

High-level synthesis, system-level transformations, register 
transfer level (RTL), polynomial datapath, univariate 
functional decomposition, canonization form 


1. Introduction 

As the complexity and size of modern embedded 
application is continuously increasing, designing hardware at 
higher levels of abstraction for faster design adjustments and 
higher simulation speed is necessary. Conventional high 
level synthesis techniques are not efficient to eliminate 
redundancy and common sub-expression for polynomial 
datapaths over Z2m. Such polynomial functions have been 
optimized manually to achieve efficient register-transfer- 
level (RTL) implementation. This process can be time 
consuming and error prone. Hence, developing high level 
synthesis and optimization techniques to automate the design 
of custom polynomial datapaths from a_ behavioral 
description is desirable. 


The Horner form of a polynomial expression is a normal 
form representation using a nested format. This method 
transforms the expression into a sequence of nested additions 
and multiplications, which are suitable for univariate 
polynomials and for sequential machine evaluation using 
multiplier-accumulator units. 


Another algebraic technique is based on kernel/co-kernel 
computation [6], in which first, lowest cost form of given 
polynomials from canonization, square-free factorization and 
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original forms is taken into consideration. Then common 
coefficients and common cubes are extracted using the 
kernel/co-kernel extraction technique from [7]. Common 
sub-expressions are determined using algebraic division 
technique. This method is only applicable to those 
polynomials in which linear blocks exist explicitly. 


In [7] and [8], a factoring method was proposed 
employing kernel/co-kernel extraction with common sub- 
expression elimination to reduce the size of implementation. 
The approximate factorization algorithm presented in [8] 
represents an arithmetic function f as a product of sub- 
functions f = f/xfox...xf, where f; is a multivariate 
polynomial. However, this algorithm is able to factorize 
square-free polynomials and cannot deal with a sub-function 
f, with a degree higher than one. 


Another algebraic method has been proposed in [3] and 
then improved in [4]. The main idea is somehow similar to 
algebraic division techniques used in logic synthesis. This 
technique tries to decompose the original polynomial poly as 
poly = p; X p2+ p3 while p; should be minimized. For doing 
so, all possible initial values of p; and p) must be evaluated. 
Then for each initialization it is necessary to check whether 
other monomials in poly can be represented in the form p; x 
P2. Finally, the best initialization, which constitutes the 
lowest complexity p3, is chosen. The algebraic technique in 
[2] improves the optimization heuristics in [3] and [4] to 
extract more common sub-expressions by considering single- 
variable and hidden monomials. This technique makes use of 
finite ring algebra and Modular Horner Expansion Diagram 
[5]. This method first reduces the original polynomials over 
Z2m. Then common sub-expressions are extracted based on 
two heuristics. The main disadvantage of this technique is 
that decompositions are started from reduced polynomials 
while if the original polynomials are used more common sub- 
expressions would be extracted. 


The Algebraic method in [1] proposed for the first time a 
kind of polynomial optimization technique based on 
redundancy addition/removal. The main idea is somehow 
similar to logic optimization based on _ redundancy 
addition/removal which has been developed in logic 
synthesis area. In this method, first, kernels/co-kernels of 
given polynomials are extracted as good building blocks, 
then a large number of vanishing polynomials over Z™, 
which are equal to 0 over Zym, are generated as redundancy 
in order to transform the given polynomials in such a way 
that more common sub-expressions can be extracted. Finally, 
using algebraic division common sub-expressions are 
determined. 


In the current paper, we introduce some system-level 
techniques for transformation of the given system of 
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polynomials, which offer more common sub-expressions. 
Our optimization method reduces the complexity of 
polynomial datapaths in terms of the number of arithmetic 
operations by performing optimization on a system-level 
prior to high-level synthesis. Furthermore, in order to 
generate RTL datapath architecture for the optimized 
polynomials, we use GAUT high-level synthesis tool [12] as 
a high-level synthesis tool, although any other high level 
synthesis tools can be utilized. Our optimization method 
reduces the area and the number of clock cycles at the RTL 
datapath architectures. In this method, we use mathematics 
concept of univariate functional decomposition of 
polynomial expressions in order to obtain good building 
blocks and hence extract more common sub-expressions. 


In summary, our design flow in this paper consists of the 
following tasks: 


e System-level transformations to optimize datapath 
designs that perform polynomial computations over Z2m 
using univariate functional decomposition and 
canonization form. 


Univariate functional decomposition of the given 
polynomials to obtain good building blocks and extract 
suitable common sub-expressions. 


High-level synthesis using GAUT [12] to generate 
datapath architectures for the optimized polynomials as 
sequential circuits. 


Evaluating the performance of the proposed method and 
showing its effectiveness by comparing it with the state- 
of-the-art polynomial optimization methods in the 
literature. 


The remainder of this paper is organized as follows. 
Section 2 introduces some preliminaries which are used in 
the rest of the paper. A motivational example is presented in 
section 3. Section 4 explains, in detail, our proposed 
polynomial optimization method. Section 5 evaluates the 
performance of our algorithms and presents experimental 
results that demonstrate their effectiveness. Finally, section 6 
provides our conclusion. 


2. Preliminaries 

This section introduces some preliminaries which are 
used in the rest of the paper. In this paper arithmetic data 
paths are modeled as polynomial functions over Z2n1 X 
Zonz Xu. X Zong toZgm[9]. Let f,(%),..., fp(X) be p given 
polynomial functions over Zyn1 X Zgn2 X ... X Zang to Zam 
as the specification where X = < x),x2,...,.X¢> 1s a vector of d 
input variables and n;, nz, ..., ng denote size of the 
corresponding variables. Z2n represents the finite set of 
integers {0, 1, ..., 2”-1}. m is the size of the output bit-vector 


Theorem 1: Let f be a polynomial function from 
Zon1 X ..X Zynq to Zym. Then according to [9], f can be 
uniquely represented in a canonical form as (1), where Y; is 
falling factorial of degree k €Z (Z denotes the ring of 
integers) and is defined as follows, 


Yo=L Yi(x) =X, 


Yo(x)= xX (x-D), 00, Ye) =V 10) X(x-k +1). 


i i < ag < ————_ , K=< 
dx is an integer such that / < ax wed” TE K=<k), ky, 


.. kg> for each k= J, 2, ..., uj, and uw, = min{2":, SF(2™)}. 
SF(n) is the least k € N such that n divides k/, and denotes 
Smarandache function [10]. gcd(x,y) computes the greatest 
common divisor of x and y. 


f= Ue AY = Ux Ak X Ve, 1) X eX Veg (Xa) (1) 


For example, let f = 2x’ +x*+x7-2x, the canonical form of 
f over Z,3 is 2x°. Note that the canonical form of a 
polynomial over Zjn1 X Zyn2 X ...X Zynq toZym may be 
Zero. 


Definition 1: If g and / are univariate polynomials, then 
univariate polynomial f(x) = g(x) o h(x) is their functional 
composition, and (g, h) is a univariate functional 
decomposition of f; where g and h are polynomials with 
lower degree than f and are called left decomposition factor 
and right decomposition factor of f, respectively. o is the 
composition operator via computing the output of g when it 
has an argument of h(x) instead of x (i.e., fx) =g(x) o h(x) 
=g(h(x))). 


Example 1: Let f(x)= x*+x°-3, then f(x)=¢(x) 0 h(x) = 
(x’+x-3) o x’ is a univariate functional decomposition of f, 
where g(x) =x°+x-3 and h(x) =x’. 


3. Motivational Example 

In this section, we present an example to motivate the 
optimization technique to be presented. In order to 
demonstrate the effectiveness of the proposed method, let us 
consider the following polynomial system. 


fils) = x'+2x? +x? txy?-3xy’?+2xy 
Sols) = x°+3x° +3x4 tr 47+ x. 
This system needs 32 multiplications and 10 additions. 


After applying the factorization technique using 
MATLAB [11] to these polynomials, f; and f> are 
transformed to the following forms 


ied) x(x + 2x? +x ty -3y° +2y) 
fobs) = x(xt D(x'+2x? +x7+ 1), 


which need 19 multiplications and 9 additions. 


By applying our proposed optimization method over Z,2 
to the original polynomials, f; is converted to the following 
form, 


h(x) =x° +x, g(x) =x? 

fil) = goh + xy’-3xy’+2xy 
because canonical form of xy’-3xy’+2xy over Z,2 is 0, 
fil=x 0 0? +x) = tx)’ =h’, 

and f> is converted as follows. 

h(x) =x° + x, g(x) =x +x, 


x” 0 (x +x) + xy*-3xy’+2xy, 
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Figure 1: (a) Datapath architecture of the polynomials, implemented using factorization, (b) Datapath architecture of the 


polynomials, implemented using our proposed method 


Sox) goh (x? 
A(h+L) =h(fj+I). 


x) 0 (+x) = (x0? + x)? + x? +x = hi th 


The optimized polynomial system requires only 3 
multiplications and 2 additions. We have used GAUT as a 
high-level synthesis tool to generate datapath architectures 
for the polynomial systems. GAUT tool has been used in 
many academic projects, and its HLS algorithms for binding, 
allocation, and scheduling are well documented [12]. 


We have used GAUT to generate datapath architectures 
for two modes; speed optimization and area optimization in 
which only one functional unit is considered for each 
operation type existed in the design. The datapath 
architecture of the polynomials, implemented using 
factorization, in the speed optimization mode is shown in 
Fig. l(a). The datapath architecture of the polynomials, 
implemented using our proposed method is shown in Fig. 
1(b). 


The results reported by GAUT for the polynomials, 
implemented using factorization and our proposed method 
are shown in Table 1. We have used “notch” library, 
provided by GAUT, and we have set clock cycle to 20. This 
table reports area and number of the clock cycles, registers, 
multiplexers, and functional units (adder, subtracter, 
multiplier) in the datapath architectures of the factored 


polynomials and optimized polynomials using our proposed 
method, in speed optimization and area optimization modes. 


Table 1. Gaut report for the polynomials, implemented 
using factorization, and for the polynomials, implemented 
using our proposed method, in speed optimization and area 
optimization modes. 


er Proposed 

Factorization Method 
Cycles 6 6 
Registers 14 4 
Speed Muses ty : 
Optimization FU - 0 0 
x 6 1 
Area 530 91 
Cycles 22 6 
Registers 13 4 
Muxes 320 32 
cabase + 1 1 
Optimization FU : 0 0 
x 1 1 
Area 91 91 


4. Proposed System-level Optimization Method 
We introduce some _ system-level techniques for 
transformation of the given system of polynomials, which 


offer more common sub-expressions. Our optimization 
method reduces the complexity of polynomial datapaths in 
terms of the number of arithmetic operations by performing 
optimization on a system-level prior to high-level synthesis. 
Furthermore, to generate datapath architecture for the 
optimized polynomials as sequential circuits, we use GAUT 
high-level synthesis tool. Our optimization method reduces 
area and number of clock cycles in the datapath architectures. 


In the first phase of the proposed system-level 
optimization method, each given multivariate polynomial 
f(x1,.... X@ is transformed to several univariate polynomials 
by representing f based on each input variable x; (1 <i < 
da). Then each obtained univariate polynomial is decomposed 
through univariate functional decomposition algorithm 
explained in subsection 4.1 in order to obtain good building 
blocks. In the second phase, to extract common  sub- 
expressions among the given polynomials, we make use of 
univariate functional decomposition algorithm unlike other 
works that utilize algebraic division technique [1][6][7]. 
Finally, among various forms of the polynomials in terms of 
the extracted common sub-expressions, the form with 
smallest number of the arithmetic operations is selected. 
These phases are explained in more details in the following 
subsections. 


4.1. Determining Building Blocks (Phase I) 

In this phase, each given multivariate polynomial f is 
transformed to several univariate polynomials by 
representing f based on each input variables. Then each 
obtained univariate polynomial is decomposed through 
univariate functional decomposition algorithm in order to 
obtain good building blocks. This phase is explained in the 
following steps. 


Step 1: Each given multivariate polynomial f(x),..., xg) is 
rewritten based on each input variable x; (1 < i < d) as (2). 


= e ej-1 | Ci+1 
f= Feieiecting aka Ma Mia ce (2) 
C4 Oj Cj 41 Og 20 
where f (x;) is a univariate polynomial which 


e€],.. re; sCj+ Jud 
represents the polynomial f based on the variable x;, and 
Cj,..5€; pejitp--»€q are degrees of d-J variables x),..., X;-1 
Xj+1,---, Xqin polynomial f- 

After applying this transformation to all given 
polynomials f; (1 <j < p) where p is the number of given 
polynomials, all obtained univariate polynomials 
Dai ptice (1 <i <d) from all fare stored in a set 


named AllSubPoly,, (1 <i<d). 


Example 2: Suppose fifey) = xy” Sx"y4 2x’y 2+ 10x°y- 
x*y-5x4 ‘yt yx -x ee 5x ey t2xy + 2x" y e+ 10x ‘y-. xy” “Sy, and 
Soy) =x°y +x4y* 2x4 + 2x8y?-2x4y try’ tay! + xy -2xy. 


Then f; based on the variable y is represented as 


fi, x(y’ +5y)4 bx (2y"4 10y)+x4(-y’-5y) +x (yy 
2y’+10y)+x(-y -5y), 


so AllSubPoly, = { fiv= Sy, fol) =2y" tay" +10y, 
Ad=y'y-Sy, fio) =-y -5y, fob) =2y +10, fol =V + Sy}. 


And /; based on the variable x is represented as 


?_Sy) +x" (2y*+ 


+ 2x x43 +2x7-x) ty(5x°+100°-5x7- 


F427) +y?0* 
4 ae we 


so AllSubPoly, ={f)(x)=5x°+10x°-5x*-5x°+10x°-5x, fo(x)=x° 
+2 x4 +2x7-x, fox) = xP +207}. 
j> based on the variable y is represented as 
A= XO) +4" Dy" + dy"-2y) +x (y') +x" +2y"-2y), 
so AllSubPoly, = AllSubPolyy U{y,y*-2y' +2y-2y,y'+2y° 
-2y}. 
And f> based on the variable x is represented as 
Ar= Vi Grit) ty O°-ax" 4x7) ty" (2x"+ 2x) +y(-2x"-2x), 


so AllSubPoly, = AllSubPoly, U {x*+x, x°-2x‘+x’, 2x* 
+2x, -2x*-2x}. 


Step 2: univariate functional decomposition is computed 
for each member of AllSubPoly,, (1<isd) (ie, 


ia ec) by using the univariate functional 


decomposition algorithm explained in the follow. 


Univariate functional decomposition algorithm: Let g 
and h be polynomials of degrees r and s over a field. Their 
functional composition f = g o h = g(h) has degree n = rxs. 
The univariate functional decomposition problem can be 
stated as follows: given f of degree n = rxs, determine 
whether such g and h exist, and in the affirmative case, 
compute them [13]. 


The pseudo code of the univariate functional 
decomposition algorithm [14], which is slightly modified in 
our method to also calculate indecomposable part of an input 
polynomial, is shown in Fig. 2. For every r and s values for 
which rxs =n, UniDec procedure in Fig. 2 with f and r as 
inputs calculates a univariate functional decomposition for f 
as (3), where fp is indecomposable part of f 


f(x) = g(x) h(x) + fo = g(h(x)) + fo (3) 


AS explained in [14], Ff, g and fA are in the following 
forms, ie X +a, xX +...+ag h =x te, p°'+..tex, g 
x'+b, x! + ... +bo, respectively. In this algorithm, first, 
coefficients of h, ie., (Cy...,C,.)), are calculated from 
coefficients of f by h_UniDec procedure (lines 4-11 in Fig. 


2). For this purpose, waren qx is defined as follows: 


O<k<s. 


S 4 1 


s-k 
Gk=x +c, ett Fey, 


Then go=x", ds= 9s-1=h, and gx = Gert Cs-x xk 1<k<s. 


According to [14], we can calculate the first k+J 
coefficients of h” from coefficients | of gq. The k+l" 
coefficient of q,’ is the coefficient of x” * this agree with a,s.;, 
ie., the k+/" coefficient of f, 1 <k <s- 1. Thus if the earlier 
coefficients C,.),...,Cs%+; Of h are known, then c,, can be 
determined by computing 


_pod 
C5 4= - Kk 1isk<s-1, 
where, d; is the coefficient of xin det [14]. 


Second, from f and h, coefficients of g, 1.¢., (bg...,5,-)), 
are calculated by g UniDec procedure (lines 12-16 in Fig. 2), 


let A/i,j] be the coefficient of x* inf’, 0 <i, j <r. Then b = 
(bg,...6,.,), can be determined by solving the following 
equation: 

Ab=a, 
where a=(do, ds,..., G;s) are the coefficients of f- 


Then, composition of 4 and g is computed by using the 
function subs, which is a function library of Maple [15] and 
computes the value of g o h. The difference between f and g 
o his considered as indecomposable part of fand refereed as 
Jo Cline 15 in Fig. 2). 


UniDec (f::polynom, r::integer) 
1: A:=h_UniDec(/, r); 
2: (g, fo)=g_UniDec(f, h, r); 
3: return each (A, g, fo); 


h_UniDec(f::polynom , r::integer) 
4: qg=x's, OSi<r 
: for k from 0 to (degree(f)/r) 
d= woek(g, 2°) ; 


0 ._,0. 
Uk+1: = dk; 


=) 
6 
7: C ((degree(£)/t)-k) =(A(degree()-k)~ d)/r ; 
8 
9 


x yi((degree()/r)-k)) 


ea BO) “ “(ct 
¥ o) s1gse 


ie oO cet 


11: return A; 


s_ UniDec(f::polynom , /::polynom , r::integer) 
: AG+1j+l)=coef(h!, xO 201). 0 <ij<r ; 

: b= LinearSolve(A, a) ; 

> g =Yo bixx'; 

: fo=f subs({x =A}, g); 

: return (g, fo); 


Figure 2: Univariate functional decomposition algorithm 


Example 3: Let us consider a member of AllSubPoly, in 
example 2; f= x° +2x°-x*-x'+2x°-x. Because n = 6, one of the 
situation that r and s >1 are r =2, s =3. So g and hare in the 
following forms. 


bix t bo. 


The steps of the univariate decomposition algorithm are as 
follows. 


h 34 24 24 
x +ox + cx, g=x 4 


qo = 1.95 = x3,q5 = x® 

step 1:(k=1) 

d, = coef (q3,x®*) = 0,c, = ee = 1 
q? = 1,qt =x9 +x?,q2 = xo + 2x5 4x4 
step 2: ( k=2 ) 


d, 


x®2) =1,¢, = wuts ee | 


coef (qi, 
qv =1qt=x3+x?-x, 

Ga e oe ay 2 
So h is obtained as h= x° +. x? — x. 


Then by using the coefficients of fand h, coefficients of g are 
calculated as follows. 


1 0 0 0 0 
0 1 -2|,a= |-1}),b=]1 
00 1 1 1 


So g is obtained as g = x°+x. 


A= 


The general form resulting by applying the univariate 
functional decomposition algorithm to each member of 
AllSubPoly,, (1 <i < d) is shown in (4). 


jee eT Ce = 9(Xi) . h(x;) + fo (a) 


(x;) € AllSubPoly,,, (1 <i < d) 


(4) 
i 


OJ, Oj- LP Oji4+ Lu ed 


Step 3: In the third step of the first phase of the proposed 
method, all obtained right decomposition factors h(x;) of all 
members of AllSubPoly,, are stored in a set named h_set,, 
as good building blocks. 


Example 4: Let us consider example 2 again. By 
computing the univariate functional decomposition of all 
members of AllSubPoly, and AllSubPolyy, h_set, for 


variable x and h_set,, for variable y are obtained as follows. 


h_set,= OP +x7-x, x’ +x, xx, e\. 
ped 2 
h_sety= {y", y-Sy}. 


4.2. Common sub-expression extraction (Phase IT) 
The aim of the second phase of the proposed method is to 
extract common _ sub-expressions between all given 
multivariate polynomials, which is equal to extract common 
sub-expressions between their equivalent univariate 
polynomials which are stored in AllSubPoly,, (1 <i < d). 


To extract common sub-expressions we make use of 
univariate functional decomposition algorithm unlike other 
works that utilize algebraic division technique [1][6][7]. By 
considering members of h_set,, as good building blocks, we 
try to re-decompose all members of AllSubPoly,, by these 
building blocks and find common sub-expressions between 
them. 


By using g UniDec procedure described in subsection 
41, each f (x;) € AllSubPoly,, is assessed 


Cpe, pCit tnd 
whether a polpnomiall g’ can be calculated from this 
polynomial and each member of h_set,, as shown in (5), 


Oe Cy, = g' (xi) . h'(x;) + fo (x;) 
(5) 
h'(x;) € h_set,,(1 Sis d) 


where g‘(x;) is a new right decomposition factor of 
: ae fo(x;) is a new indecomposable part of 


J, 

e],. 
that each member of h_set,, may be belonged to different 
members of AllSubPoly,,,. 


To reduce cost of the corresponding hardware 
implementation of each polynomial, we make use of 
canonical representation of f)(x;) over Zn; to Zzm, which is 
explained in section 2, and then compare cost of the 
canonical form with the original form of f(x;), and select 
the lower cost form. 


peith Cd? 


and h’(x;) is a member of h_set,... Please note 
5O peit bend xi 


Example 5: Let us consider two members of 
Dt dD: 


AllSubPoly, in example 2; p;= x°+2x’-x*-x°+2x’-x belonged 
to f;, and p= x°-2x‘+x’ belonged to fs. Two members of 
h_set, in example 4 are h p= tx7-x, and h j=xe-x which are a 
right decomposition factor of p,; and p>) respectively. By 
applying the common sub-expression phase to p; and pz over 
Z>3, two obtained forms of these polynomials are as follows. 


Form 1: 
P1=(0-x) 0 hy + fo = (°-x) 0 (x?-x) +2x? +x*+x7-2x, 
canonical_form(f) = 2x’, so p=(x°-x) 0 (x°-x) +2x° over Z 235 
PHY Oh=xX 0 >), 

Form 2: 

Di=(02 +x) 0 hy=(2-x) 0 (x +x?-x), 


p2=(0+2x) 0 hy + fo= (0° +2x) 0 (x? +x7-x) -2x°-x*-2x? +2x. 


canonical form(f)) = -3x°, so p= (x°+2x) 0 (x? +x7-x) -3x° 
over Z53. 


Therefore result of the common sub-expression extraction 
phase is various decompositions of each /, oT 
Gi pet 


on the different building blocks belonged to different 
members of AllSubPoly,,. 


4.3. Complete system-level optimization algorithm 
The complete proposed system-level optimization 
algorithm is explained in this subsection. The pseudo-code of 
the proposed method is shown in Fig. 3. Lines 3-10 describe 
the first phase of the proposed method in which each input 
multivariate polynomial is transformed to several univariate 
polynomials which are stored in AllSubPoly,, (1 <i < d) 
(lines 3-6). Then univariate functional decomposition 
algorithm in Fig. 2 is applied to these univariate polynomials 
and then all obtained right decomposition factors are stored 
in h_set,, as good building blocks (lines 7-10). Lines 11-16 
describe the second phase in which each member of 
AllSubPoly,, is re-decomposed by members of h_sety,, 
using g UniDec procedure in Fig. 2 (lines 11-14), and 
common sub-expressions are determined. The canonical form 
over Zyn; to Zym is calculated for fy(x,) in order to reduce 
cost of the implementation (lines 15-16). Finally, to select the 
form with the smallest number of arithmetic operations, 
every new generated form of all members of AllSubPoly,., is 
considered to evaluate related hardware implementation by 
computing cost_func function. This function determines the 
number of arithmetic operations such as additions and 


multiplications needed for implementation of the given 
polynomials (lines 17-22). 


Optimization Algorithm (LP, d) 
: LP := List of input polynomials (f;, f, ..., f) 
: d:= Number of variables 
: for j from | top 
for i from | tod 


‘ : oe e 
consider f; as f=Dey,..€,, 641-8420 He ce seine XX) ” 


e ” 
6x xxx xc Sd 
i] t+ d 


Add every f. (x;) to AllSubPoly,.; 


: fori=ltoddo 
for k from | to size(AllSubPoly ,) 
(A, g, Soy =UniDec(AllSubPoly 7); 
Add h to h_set,.; 
: fori=1toddo 
for j from | to size(AllSubPoly,.) 
for each h' of h set,.; 
(g', fo):=g_UniDec(AllSubPoly,,., h'r); 
15: if(cost_func (canonical_form(/)) < cost_func (f")) 
16: f'o:= canonical_form(/"y); 
17: for j = 1 to p do 
18: f= reconstruct /; from AllSubPoly pe 
19: for every combination of /’; (transformation of the original f) 
20: if(cost_func(current combination)<min_cost) then 
21: minimum cost combination = current combination 
22: return minimum cost combination 


[Oy Cit] Od 


Figure 3: System-level optimization algorithm 


Example 6: Let us consider the polynomial system in 
example 2. This polynomial system originally needs 115 
multiplications and 23 additions. By applying our proposed 
method, we get an implementation with only 16 
multiplications and 10 additions as shown below. 


hx, b= xl), B= tty 
i= tt +2) + yOt5)t(ts+ L) , 
A= (t7 +x)(t(2tty)-2y)+ ytyty’. 
The results reported by GAUT for the datapath architectures 
of the original and optimized polynomials as sequential 
circuits, in speed optimization mode, are shown in Table 2. 


2 
ty=y, 


Table 2. GAUT report for the original and optimized 
polynomials. 


oe) ell oe 
s 3 z FU 5 
ia) 6D < 
6) a | = 
~ + = x 
Original 15 21 320 3 12 | 1028 
Optimized 8 12 | 224 2 1 4 356 


5. Experimental Results 

In order to show the effectiveness of our proposed 
optimization method, we have employed different 
polynomials extracted from real embedded systems. Various 
combinations of multivariate cosine wavelet (MVCS) for 
graphic applications [7], Savitzky-Golay (SG) filters [16] and 
digital image rejection unit (DIRU) for image processing 


applications, Quadratic filters (Quad) for DSP applications 
[17], Phase-Shift Keying (PSK) for digital communication 
[18] have been taken into account as multi-output 
polynomials. 


Table 3. Comparison of the datapath architectures in the 
speed optimization mode. 


DIRU | DIRU | DIRU | DIRU 
PSK | PSK | Quad |MVCS| %A 
Quad | SG2 SG2 SG2 
Cycles 17 17 16 16 0.00 
Registers 22 24 20 20 
5|  Muxes 384 352 324 336 
5 + 2 2 3 4 
=| FU] - 1 1 0 0 
x 11 13 7 7 
Area 937 1103 605 613 
=| _Cyeles 12 15 17 13 13.42 
=i Registers 19 37 21 24 
“=| Muxes 272 368 306 304 
& + 3 3 3 4 
£| FU |_- 1 1 0 0 
3 x 6 21 7 7 
Tl Area 530 1775 605 613 
—| Cycles 15 15 11 13 18.32 
= Registers 16 15 20 26 
‘=| Muxes 243 256 304 288 
& + 2 2 4 4 
£| FU | - 1 1 0 0 
3 x 5 5 7 15 
| Area 439 439 613 1277 
A Cycles 13 13 11 11 27.39 
3 Registers 17 19 20 22 
©| Muxes 320 336 288 272 
= + 2 2 2 3 
roll 3 mal ee 1 I 0 0 
5 x 4 5 7 7 
Area 356 439 597 605 


We have implemented the proposed method along with 
the methods in [1] and [6] in Maple [15], and then we have 
used GAUT as a high-level synthesis tool [12] to 
automatically generate datapath architectures for obtained 
polynomials. 


To generate datapath architectures based on optimized 
polynomials, we considered two modes provided by GAUT; 
speed optimization and area optimization. Table 3 illustrates 
the results obtained using our proposed method, Horner 
form, and methods in [1] and [6] for speed optimization 
mode. This table reports area and number of the clock cycles, 
registers, multiplexers, and functional units (adder, 
subtracter, multiplier) in the obtained datapath architectures. 
%A indicates the percent of improvement in the number of 
required clock cycles in all methods compared to the Horner 
form. The results in the table indicate that in our method 
required clock cycles are reduced by an average of 38.11%, 
20.10%, and 12.24% in comparison with the Horner form, 
the method in [6] and the method in [1] across all 
benchmarks. This reduction indicates that our goal of 
reducing critical path delay has been achieved. Furthermore, 
area is improved in our method with an average improvement 
of 25.81% in comparison with other works. 


Table 4. Comparison of the datapath architectures in the 
area optimization mode. 


DIRU | DIRU | DIRU | DIRU 
PSK | PSK | Quad | MVCS | %A 
Quad | SG2 SG2 SG2 
Cycles 72 86 68 86 0.00 
Registers 13 15 13 16 
5| Muxes 592 656 560 672 
E 1 1 1 1 1 
m| Fu] - 1 1 0 0 
x 1 1 1 1 
Area 99 99 91 91 
=|_Creles 52 91 72 71 8.38 
= Registers 16 27 17 22 
‘5 |__Muxes 672 1056 800 832 
& + 1 1 1 1 
| FU [_- 1 1 0 0 
3 x 1 1 1 1 
| Area 99 99 91 91 
—|_ Cycles 36 36 55 65 | 37.92 
‘eS Registers 12 11 18 19 
=| Muxes All 432 752 832 
& + 1 1 1 1 
£| Fu [_- 1 1 0 0 
3 x 1 1 1 1 
| Area 99 99 91 91 
Cycles 4] 48 49 52 | 38.68 
3 |_Registers | 16 17 18 18 
©|  Muxes 624 704 720 752 
= + 1 1 1 1 
=| FU | - 1 1 1 1 
5 x 1 1 0 0 
Area 99 99 91 91 


In the area optimization mode, only one functional unit is 
considered for each operation type existed in the design (i.e., 
all operations with the same type should be bound to a same 
functional unit). Table 4 illustrates the results obtained using 
our proposed method, Horner form, and methods in [1] and 
[6] for area optimization mode. The results reported in the 
table indicate that our proposed method provides fewer 
required clock cycles in comparison with the Horner form, 
the method in [6] and the method in [1] with an average of 
64.73%, 49.97%, and 0.01%, respectively, across all 
benchmarks. Such improvement in the required clock cycles 
with a fixed number of functional units indicates that by 
using our method the number of operations would be fewer 
than those of other works. In other words, our proposed 
method can efficiently determine common sub-expressions. 


6. CONCLUSION 

In this paper we have proposed a_ system-level 
optimization method for the data paths implemented using a 
system of polynomials. Our method optimizes polynomials 
to reduce the complexity of polynomial datapaths in terms of 
the number of arithmetic operations over Zj,m. In the 
proposed method first all given multivariate polynomials are 
transformed to several univariate polynomials. Then 
univariate functional decompositions are calculated for them 
to obtain good building blocks. To extract common sub- 
expressions we make use of univariate functional 


decomposition algorithm. We have used GAUT high-level 
synthesis tool to generate RTL datapath architectures for the 
optimized polynomials as sequential circuits. Experimental 
results show superiority of our approach in the area and delay 
savings in contrast with the other related works. As a future 
work, we are going to utilize multivariate functional 
decomposition algorithm to extract better building blocks. 
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